-
Notifications
You must be signed in to change notification settings - Fork 552
PhysicsNemo Datapipes #1304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
PhysicsNemo Datapipes #1304
Conversation
Greptile Summary
Important Files Changed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (47)
-
test/datapipes/core/test_readers.py, line 200 (link)syntax: Debug print statement should be removed
-
test/datapipes/core/transforms/test_subsample.py, line 126-127 (link)syntax: Comment has a line break in the middle of a word
-
v2.0-MIGRATION-GUIDE.md, line 58 (link)syntax: Extra asterisk in 'DataPipes**' should be 'DataPipes'
-
physicsnemo/datapipes/core/__init__.py, line 37 (link)syntax: Inconsistent naming in example - uses
dp.Downsamplebut exportsSubsamplePoints -
examples/minimal/datapipes/tutorial_04_hydra_config.py, line 54 (link)style: torch is imported but not used in this tutorial
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
test/datapipes/core/test_dataset.py, line 238 (link)style: Inconsistent device specification - uses
cuda:0in line 138 butcudahere without device index -
examples/minimal/datapipes/generate_variable_points_data.py, line 298 (link)syntax: Backend choice inconsistency: example shows '-b npy' but valid choices are only 'npz' and 'zarr'
-
test/datapipes/core/test_collate.py, line 227-228 (link)logic: Test function returns first sample without proper tuple unpacking - should return (data, metadata) tuple for consistency
-
examples/minimal/datapipes/README.md, line 25 (link)syntax: Typo: 'ascynchronous' should be 'asynchronous'
-
examples/minimal/datapipes/README.md, line 30 (link)syntax: Typo: 'reproducability' should be 'reproducibility'
-
examples/minimal/datapipes/README.md, line 81 (link)syntax: Grammar: 'for understand' should be 'for understanding'
-
physicsnemo/datapipes/core/dataloader.py, line 56 (link)syntax: Import statement uses incorrect module name - should be
from physicsnemo.datapipes import ...instead offrom datapipe import ... -
examples/minimal/datapipes/tutorial_02_transforms.py, line 292 (link)syntax: Grammar: 'it's' should be 'its' (possessive, not contraction)
-
.github/CODEOWNERS, line 62 (link)style: Trailing whitespace after
@pzharringtonshould be removed -
examples/minimal/datapipes/tutorial_03_custom_gnn_datapipe.py, line 241-242 (link)logic: Variable
n_edgesis misleading - this represents edges per node (k value), not total edges in the graph. -
examples/minimal/datapipes/tutorial_03_custom_gnn_datapipe.py, line 246-247 (link)logic: The calculation
n_edges / n_nodesis incorrect for average degree - should ben_edgessince n_edges already represents k neighbors per node. -
physicsnemo/datapipes/core/collate.py, line 169 (link)logic: Using
torch.stack()on TensorDict objects directly - should verify this is the intended API as TensorDict.stack() might be the correct methodShould this be
TensorDict.stack(data_list, dim=self.stack_dim)instead oftorch.stack()? -
physicsnemo/datapipes/core/collate.py, line 353 (link)logic: The function signature indicates it returns a tuple but uses a DefaultCollator instance that may not collate metadata by default
Should the _default_collator be initialized with collate_metadata=True to match the return type annotation?
-
test/datapipes/core/test_transforms.py, line 517-612 (link)style: Large blocks of commented test code should be removed or converted to proper TODO issues rather than left in the codebase
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
test/datapipes/core/test_transforms.py, line 619-659 (link)style: Another large block of commented test code that should be cleaned up
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
examples/minimal/datapipes/tutorial_01_getting_started.py, line 300 (link)logic: This line will cause a TypeError.
batch_datais a TensorDict, not a tuple/list, sobatch_data[1]is invalid. Remove this debug print statement or access a valid key. -
test/datapipes/core/test_transforms.py, line 764-767 (link)style: Remove commented test code to keep the file clean
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
physicsnemo/datapipes/core/transforms/compose.py, line 93-95 (link)style: Missing return type annotation for
__iter__method. Should beIterator[Transform]. -
physicsnemo/datapipes/core/readers/numpy.py, line 186 (link)style: This random sampling could cause reproducibility issues if no seed is set. Consider using a seeded random state or documenting the need for external seed management. Should this method accept a random state parameter or use a class-level random state to ensure reproducible subsampling?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
physicsnemo/datapipes/core/readers/numpy.py, line 252 (link)style: The
np.array(arr)wrapping may create an unnecessary copy. Sincearris already a numpy array from npz,torch.from_numpy(arr)should work directly.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
physicsnemo/datapipes/core/transforms/geometric.py, line 53-57 (link)syntax: Example uses undefined
Sampleclass - should useTensorDict -
physicsnemo/datapipes/core/transforms/geometric.py, line 225-228 (link)syntax: Example references
TranslationInvariancebut class is namedTranslate -
physicsnemo/datapipes/core/transforms/geometric.py, line 282-287 (link)logic: Logic error: device assignment after type error could be unreachable
-
physicsnemo/datapipes/core/transforms/geometric.py, line 305-308 (link)syntax: Example references
ScaleInvariancebut class is namedReScale -
physicsnemo/datapipes/core/transforms/subsample.py, line 147-150 (link)style: The example uses
Samplebut the actual parameter type isTensorDict. This inconsistency could confuse users.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
physicsnemo/datapipes/core/transforms/subsample.py, line 244 (link)logic: Missing validation that weights tensor has the same length as the data tensors being sampled.
-
physicsnemo/datapipes/core/readers/tensorstore_zarr.py, line 34 (link)style: Using
check_version_specinstead of the recommendedcheck_min_versionfrom MOD-011. The coding standards specify usingcheck_min_version(package, version, hard_fail=False)for optional dependency handling.Context Used: File from
greptile.json- CODING_STANDARDS/MODELS_IMPLEMENTATION.md (source) -
physicsnemo/datapipes/core/readers/tensorstore_zarr.py, line 179-185 (link)style: The
_spec_templateis defined but never used in the implementation. Consider removing it or documenting its intended purpose.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
physicsnemo/datapipes/core/readers/tensorstore_zarr.py, line 272 (link)style: Field discovery is called twice (line 167 and here) for the same group, which is inefficient. Consider caching the result from the first discovery.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
physicsnemo/datapipes/core/transforms/spatial.py, line 53-58 (link)logic: Example uses undefined
Sampleclass instead ofTensorDict. Should be consistent with actual data structure used in the transforms. -
physicsnemo/datapipes/core/transforms/normalize.py, line 285-291 (link)style: Device transfer modifies internal state during forward pass which could cause issues in multi-threaded environments or when the same transform is used across different devices simultaneously
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
physicsnemo/datapipes/core/transforms/normalize.py, line 197 (link)syntax: Return type annotation is imprecise - should be
dict[str, dict[str, torch.Tensor]]to match the actual structure returned -
physicsnemo/datapipes/core/transforms/normalize.py, line 357-371 (link)logic: Inconsistent with forward pass - inverse method doesn't update internal state when transferring to device, which could cause device mismatch issues
-
physicsnemo/datapipes/core/transforms/spatial.py, line 285-286 (link)logic: Potential indexing error when
k=1. Slicing[:, 1:]on a tensor with shape[M, 1, ...]results in empty tensor[M, 0, ...]. Should there be validation that k > 1, or should the slicing logic handle the k=1 case differently? -
physicsnemo/datapipes/core/dataset.py, line 77 (link)syntax: Import path in docstring example uses
from datapipe importbut should befrom physicsnemo.datapipes.core importto match the actual module structure -
physicsnemo/datapipes/core/readers/hdf5.py, line 128 (link)style: Opening HDF5 file handle in constructor without explicit mode could lead to resource leaks if constructor fails after this point.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
physicsnemo/datapipes/core/readers/vtk.py, line 32 (link)logic: Using deprecated function name. Should use
check_min_versioninstead ofcheck_version_specaccording to MOD-011.Context Used: File from
greptile.json- CODING_STANDARDS/MODELS_IMPLEMENTATION.md (source) -
physicsnemo/datapipes/core/readers/vtk.py, line 165 (link)logic: Duplicate key
surface_normalsexists in both_stl_keys(line 161) and_vtp_keys(line165). This could cause confusion when filtering keys. -
physicsnemo/datapipes/core/readers/vtk.py, line 247-252 (link)logic: Inconsistent logic:
need_stldefaults to True whenkeys_to_readis None, butneed_vtpandneed_vturequire explicit keys. This asymmetry may cause unexpected behavior. -
physicsnemo/datapipes/core/transforms/field_processing.py, line 111-112 (link)logic: Scalar expansion logic may not handle multi-dimensional features correctly. A 1D feature with shape (5,) would be stacked incorrectly with scalars that get expanded to (1,).
-
physicsnemo/datapipes/core/transforms/field_processing.py, line 120 (link)syntax:
broadcast_to(n_points, -1)syntax is invalid. PyTorch requires explicit target shape like(n_points, fx.shape[-1]). -
physicsnemo/datapipes/core/transforms/field_processing.py, line 124-128 (link)style:
__repr__is missingn_points_keyparameter in the representation stringNote: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
48 files reviewed, 47 comments
peterdsharpe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitting initial comments so far; continuing with review...
| self._device: Optional[torch.device] = None | ||
|
|
||
| @abstractmethod | ||
| def __call__(self, data: TensorDict) -> TensorDict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are re-implementations allowed to have optional kwargs, or is it really just data allowed and that's it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design is really "just data and that's it". The call to all transforms is mostly automatic and transparent to the user, it happens in the dataset:
# Apply transforms (data is now on target device if specified)
if self.transforms is not None:
if stream is not None:
with torch.cuda.stream(stream):
data = self.transforms(data)
# Record event for synchronization
result.event = torch.cuda.Event()
result.event.record(stream)
else:
data = self.transforms(data)
My intention is that configuration of transforms happens at initialization - and you can have multiple of the same kind with different arguments, that's fine. Is there some dynamic behavior we need that isn't supported?
| @@ -0,0 +1,447 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would these tutorials be perhaps clearer as Jupyter notebooks?
Seems like there's a lot of print() going on that might be better presented as a Markdown block. (Plus other benefits, like being able to convert this into a docs page later, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I didn't do that because we don't have many notebooks floating around in the examples section. But it is cleaner in a notebook, yeah.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about as a marimo notebook/script? Best of both worlds :)
Co-authored-by: Peter Sharpe <[email protected]>
Co-authored-by: Peter Sharpe <[email protected]>
Alexey-Kamenev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| @@ -0,0 +1,114 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked over Slack about the pros/cons of the ./datapipes/core/ vs ./datapipes/ dir structure -- either is fine with me and I'll leave this to your discretion, just putting this comment as a reminder.
Add more transform functionality; implement hydra registry.
| ValueError: If a new key name already exists in the data. | ||
| """ | ||
| # Check for missing keys if strict mode | ||
| data_keys = set(str(k) for k in data.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is casting str(k) on a k from data.keys() type-safe in the case of nested TensorDicts? Or what if you want to rename keys in a nested TensorDict - wouldn't the input and output types be tuple[str, ...] | str, depending on whether they're nested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah hmmm I didn't consider nested dictionaries here, only top level ones.
At the highest level of a dict, casting to str should be fine on all the keys. I am inclined to say "everything we build and ship in the core repo is single layer" while users are welcome to inherit and replace functionality as needed. The goal is to be extensible and configurable, not necessarily perfectly generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough! That said, the easy solution would be to just flatten and unflatten before/after:
# `data` is a TensorDict, possibly with nested TensorDict keys
data_f = data.flatten() # now all keys are "parent.child.grandchild" strings
# <do all the transforms>
data = data_f.unflatten()which opens up the possibility of using nesting without errors at the cost of just 2 added lines of code.
I can imagine nested TensorDicts could be quite useful as we do more multiphysics simulations in the future - e.g., coupling a DoMINO fluids model with a MeshGraphNet structural model. But it's your call here if you want to implement it now or save it for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I went with the flatten/unflatten route. I also added some tests on this. You're right that we're maybe going to want to implement nested dictionaries with complicated data pipes.
peterdsharpe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, excited to get this in! Readers look great.
Some areas in the transforms that should be fixed/addressed before we pull this in - might be worth considering making Transform a tensorclass?
Tutorials look great - the suggestion to convert to ipynb is there, but up to you if you want to do it just yet.
| Raises: | ||
| KeyError: If strict=True and a specified key is not found. | ||
| """ | ||
| available_keys = set(str(k) for k in data.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more robustly implemented with TensorDict.select and TensorDict.exclude, which work in the case of nested TensorDicts: https://docs.pytorch.org/tensordict/stable/tutorials/tensordict_keys.html#selecting-and-excluding-keys
I think this implementation breaks in the nested case due to the direct .keys() iteration?
| @@ -0,0 +1,92 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general in the transforms in this folder:
Be careful about direct iteration over tensordict.keys() or .items() or .values(), as opposed to using TensorDict-specific interfaces. This:
- Can sometimes result in unintended behavior in the case of nested TensorDicts, as by-default, this will only get the top-level keys rather than all leaves.
- In the case where you do get the leaves (e.g.,
.keys(include_nested=True)), the resulting keys are not necessarily strings - they aretuple[str, ...] | str, depending on whether they're nested or not. - As an alternative to specifying whether nested keys are included, you can also use the
.flatten()and.unflatten()interface to deal with this (but it may be less memory-efficient? Not sure).
Overall these transforms look good - just want to be sure they don't break in the case of nested keys.
More info on this:
https://docs.pytorch.org/tensordict/stable/tutorials/tensordict_keys.html#nested-values
| # Normalize: (x - mean) / std | ||
| updates[key] = (tensor - mean) / (std + self.eps) | ||
|
|
||
| else: # min_max |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| else: # min_max | |
| elif self.method == "min_max" # min_max |
Helps readability and type checkers to explicitly check - can do an else: raise ValueError at the end.
laserkelvin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some general comments, particularly on the write ups.
The main point I want to bring up in my review is I feel like the unit tests for readers in particular could do with some negative cases, just to mark what is and what isn't supported (e.g. deviations from the expected schema for each reader). I would have also preferred fuzz testing for data in general, as well, and not rely on testing against specific shapes and keys.
PhysicsNeMo Pull Request
Description
This PR implements Physicsnemo Datapipes as configurable pipelines with reusable components. It is meant to provide a foundation for training and inference pipelines, though full scale replacement / updating / deprecation of existing targeted pipelines is for the future and not implemented here.
I will defer a description of the datapipe design to the README and tutorials - located in
examples/minimal/datapipes/.The core components of the datapipe are all present:
Checklist
Dependencies
Review Process
All PRs are reviewed by the PhysicsNeMo team before merging.
Depending on which files are changed, GitHub may automatically assign a maintainer for review.
We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.
AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.